Linear regressions in R assume a linear relationship between one or more predictor variables \(X_i\) and a response variable \(y\).
\[y_i \sim Normal(\mu_i, \sigma) \\ \mu_i = \alpha + \beta x_i\]
The first element of this model describes how the response variable \(Y\) can be modeled as a normal random variable with mean and standard deviation.
The second element of this model adds the effect of the predictor variable \(X\) on \(Y\), based on the causal relationship \(X \rightarrow Y\). In this simple case, the relationship is linear and can be described by the equation of a line with intercept \(\alpha\) and slope \(\beta\).
In the introductory lecture for linear models, we used an example dataset to fit a simple regression model.
The dataset itself was generated according to the specifications of the model above. We are going to recreate the dataset and some of the calculations presented during the class.
set.seed(4) #ensures the result is reproducible. The random number generation can vary between Operative systems so maybe there will be different datasets in the classroom
N = 30
x <- runif(N, 0, 5)
n <- length(x)
mu <- 1.2 + 3.5 * x
y <- rnorm(N, mean = mu, sd = 3)
lm_data <- data.frame(x, mu, y, res = y - mu)
Create a plot for the dataset
plot(y ~ x, pch = 19)
As you can see, plotting the data suggests a linear relationship between these two variables
We can calculate the correlation coefficients using several methods. The Ordinary Least Squares (OLS) method is the most common, and provides analitical solutions for the coefficients. The slope is the ratio of the covariance between \(x\) and \(y\) and the variance of \(x\). The intercept is the mean of \(y\) minus the product of the slope and the mean of \(x\).
\[\widehat{\beta}= \frac{Cov(x, y)}{Var(y)} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
and
\[ \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} \]
[1] 3.133357
[1] 3.133357
[1] 3.014409
Add a regression line with the values of the coefficients you calculated
When we generated the dataset (here), \(\mu\) was specified as:
mu <- 1.2 + 3.5 * x_iThis means \(\alpha = 1.2\) and \(\beta = 3.5\). How do these values compare to the coefficients you just calculated?
Add a population line to the plot you previously created by completing the code below:
plot(y ~ x, pch = 19)
abline(a = alpha_hat, b = beta_hat, col = 2)
abline(a = ..., b = ..., col = "blue")See the code by clicking Show code:
Function lm() fits regression models using Ordinary Least Squares.
The values of the coefficients can be found using function coef().
Compare this result with the calculations you did previously.
Before making inferences with your fitted model, it is also a good practice to make some model diagnostics. For linear regression, the most basic diagnostic is to check if the residuals follows a normal distribution with zero mean and a constant standard deviation.
Because the normal distribution has a parameter that corresponds to the expected value, and that is independent of the other parameter, the linear regression model we have been writing as:
\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \sum_{j=0}^{J} \beta_{j} x_{ij} \\ \end{align} \]
Can be also written as:
\[ \begin{align} y_i & = \sum \beta_j x_ij + \epsilon \\ \epsilon & \sim Normal(\mu, \sigma) \\ \end{align} \]
That is, the values of the response \(y_i\) are each a weighted sum of the predictor variables \(x_j\) plus a random Gaussian residual \(\epsilon\). To express the random variation symmetrically around the expected value, this residual has a mean value of zero, and a fixed standard deviation.
To check if this assumption is true, we plot the residuals of the
fitted model as a function of the predicted values. This plot should
show a point cloud of constant width around zero in the Y-axis. You
can check this assumption applying the function plot to the object
that stored the fitted model:
plot(mod, which = 1)
You can also check the normality of the residuals with a qq-plot. Normal data should lie in a straight line in this plot:
plot(mod, which = 2)
What is your overall evaluation of the normality of the residuals?
You can find an excellent explanation of these and other diagnostic plots for linear regression here
In the previous section, we fitted a linear model using the OLS method. Now, we are going to fit the same model using a Bayesian approach.
The Bayesian approach to linear regression is based on the posterior distribution of the parameters \(\alpha\) and \(\beta\). This posterior distribution is proportional to the product of the likelihood of the data given the parameters and the prior distribution of the parameters.
The likelihood of the data given the parameters is the same as the one used in the OLS method. The prior distribution of the parameters is usually a normal distribution with mean zero and a fixed standard deviation.
We will generate a new data set with known parameters to illustrate some differences between the OLS and Bayesian methods.
set.seed(4)
x = runif(50, 50, 100)
y = 1000 + 150 * x + rnorm(50, 0, 1000)
data = data.frame(x, y)
Always visualize your data before fitting a model
plot(y ~ x, data = data)
first, lets fit the model using the OLS method:
mod_ols <- lm(y ~ x, data = data)
# The summary function provides a lot of information about the model,
# including estimates, standard errors, p-values for the coefficients,
# and residual standard deviation (sigma)
summary(mod_ols)
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-1756.91 -635.64 -82.71 568.75 1867.01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2953.742 684.378 4.316 7.9e-05 ***
x 125.815 8.528 14.754 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 866 on 48 degrees of freedom
Multiple R-squared: 0.8193, Adjusted R-squared: 0.8156
F-statistic: 217.7 on 1 and 48 DF, p-value: < 2.2e-16
Now, let’s fit the model using the rstanarm package, which fits Bayesian models using the Stan language and provides a nice set of standard weekly informative priors.
suppressPackageStartupMessages(library(rstanarm))
mod_bayesian <- stan_glm(y ~ x, data = data)
summary(mod_bayesian)
Let’s compare the coefficient estimates of the two methods.
coef(mod_ols)
coef(mod_bayesian)
library(ggplot2)
# Extract intercept and slope for both models
ols_intercept <- coef(mod_ols)[1]
ols_slope <- coef(mod_ols)[2]
bayesian_intercept <- coef(mod_bayesian)[1]
bayesian_slope <- coef(mod_bayesian)[2]
ggplot(data, aes(x, y)) +
geom_point() +
geom_abline(intercept = ols_intercept, slope = ols_slope, color = "red", linetype = "solid") +
geom_abline(intercept = bayesian_intercept, slope = bayesian_slope, color = "blue", linetype = "dashed") +
theme_bw()
How similar are the estimates of the two methods?
Now, let’s use the rethinking package to fit the model using some standard priors we used in class.
suppressPackageStartupMessages(library(rethinking))
mod_rethinking = ulam(
alist(
y ~ normal(mu, sigma),
mu <- a + b * x,
a ~ normal(0, 1),
b ~ normal(0, 1),
sigma ~ exponential(1)
),
data = data, chains = 4, cores = 4
)
precis(mod_rethinking)
How are the estimates now? Awful!!
What is happening is that the default priors in the rethinking code are too far away from the true values of the parameters. One way to avoid this type of problem is to scale the variables before fitting the model.
data$x_scaled = scale(data$x)
data$y_scaled = scale(data$y)
mod_rethinking_scaled = ulam(
alist(
y_scaled ~ normal(mu, sigma),
mu <- a + b * x_scaled,
a ~ normal(0, 1),
b ~ normal(0, 1),
sigma ~ exponential(1)
),
data = data, chains = 4, cores = 4
)
mod_stanarm_scaled = stan_glm(y_scaled ~ x_scaled, data = data)
mod_ols_scaled = lm(y_scaled ~ x_scaled, data = data)
Now, let’s compare the estimates of the two models.
Scalling variables is a good practice when fitting models with Bayesian methods. It improves computational efficiency and make defining reasonable priors easier.
Move on to the multiple regression lab, where we go over the two examples shown in class.